README:

All the dependencies are listed in 'environment.yml'.

Run "conda env create --file environment.yml" for environment setup.

0. Prework before we start

This part includes all the preworks before answering the homework.

0.1 More about Data

If you are a non-US citizen, "TSA" is highly possible an unknown word to you(me neither). Background investigation is the key to interpreting the data in the first place. Therefore, I will try to answer the questions below in this paragraph:

  1. What is TSA, and how it works?
  2. What is the data 'TSA Claims' about, and how is it collected?

What is TSA, and how it works?

From wikipedia, you may find the information below:

The Transportation Security Administration (TSA) is an agency of the U.S. Department of Homeland Security that has authority over the security of the traveling public in the United States.

The TSA has multiple screening processes and regulations related to passengers and carry-on luggage, including; identification requirements, pat-downs, full-body scanners, device restrictions, and explosives screening.

In a word: TSA is an organization responsible for screening of passengers and baggage at U.S. airports for security reasons.

Logo of the TSA

What is the data 'TSA Claims' about, and how is it collected?

You can find below descriptions in this page.

If you have experienced a loss or damage to your property and you feel that this loss or damage occurred as a direct result of negligence by a TSA employee, you may file a claim with TSA. If you feel the loss or damage was due to the negligence of your air carrier, please file a claim directly with the air carrier. If filing with TSA, you must include proof of your loss or damage as well as evidence of TSA negligence.

Simply speaking, as a passenger, if you experienced a loss or damage due to the screening of TSA, you could file the claim to TSA for compensation.

The data is published at this official site.

I have listed some rows of the claim data below. Each row describes one claim: when and where it happened, what happened, and if the claim was approved or not.

0.2 Data cleaning and table union

We got three claims files, before union all the files, let's check if there is any header changes.

What we noticed:

  1. Data missing(Null value) decreases in the newer file, especially the column related to airport and airline.
  2. Header changed at the file "TSA_Claims-2010-2013.xls".
    • Columns 'Claim Amount' and 'Status' are removed from the table.
    • Orignal 'Item' column changed to 'Item Category' with lower granularity.
  3. Fields type are not aligned.

Let's also check the Categorical columns before merge!

What we found:

  1. There is no significant gap between 2002-2006 and 2007-2009.
  2. New 'Claim Type' was added to the 2010-2013 data. (This is common, and no action is needed.)
  3. '-' was added at 2010-2013 data to represent Null. We can replace it by Null.

Union tables with some cleaning!

We also can join the airport-related tables. Join all the tables with too many columns is not a good practice, but the size of the dataset is fine here.

0.3 Simple EDA before the start!

Parallel categories plots are used here for "storytelling" purposes! We can tell at where what accidents happened most and what the ending would be.

Some observations:

  1. "Property Damage" happens most at "Checkpoint" and "Checked Baggage". While "Property Loss" mainly occurs at Checked Baggage.
  2. Claims about "Property Damage" are easier to be approved ro settled than "Property Loss."
  3. Most of the "Personal Injury" cases happened at Checkpoint. Almost all litigation cases come from this type of claim.
  4. "Passenger Theft" happened most at "Checked Baggage".
  5. Claims about "Employee Loss" are easier to be approved ro settled.

Wordcloud plot to see what item passengers have claimed.

See how claims distributed in a map.

Q1. What was the average number of days between incidents and report dates?

We can observe some outliers in the data from the output below. Either negative value or enormous value is not possible here.

We should drop the outliers first.

Q1 Answer: After dropping the outlier, we got the mean the average number of days is around 27, and the median is 21.

Q1 Discussion: Discuss any kind of seasonality or grouping you see in this data.

What we got from the visualization below:

  1. The year is a matter: We found the average date difference kept decreasing until 2006. We could assume that TSA was optimizing its claim processing pipeline and fixed it after 2006.
  2. We could not find strong seasonality between months and weekdays. It seems summer timing has a shorter date difference, but the gap is small.
  3. We found claims about "Personal Injury" and "Motor Vehicle" have longer gap. Which also lead to longer gap about claim site "Motor Vehicle".

Q2. Which state receives the most highest average value of claims?

We noticed some extreme values in the column "Claim Amount." They mainly come from the claim type of "Personal Injury". Let's choose the statistic carefully!

Q2 Answer:

Solution 1: Result without any processing. (Attention: The number could be misleading due to outliers!)

Solution 2: Drop the outliers out of 1.5 IQR.

Solution 3: Use median. Usually, Percentiles are more robust than average. (The result is aligned with the second solution perfectly.)

Solution 4: Take the mean of approved "Close Amount". (This could be the best statistic if your purpose is to know the average of "reasonable claim value.”)

We could also visualized the result(I choose the 4th solution) geospatially.

Q2 Discussion: Discuss these results with any caveats.

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses. -- Wikipedia "Outlier"

Average is a statistic that could be easily impacted by outliers. In this case, we noticed a data point with a claim amount equal to three trillion USD. (Japan's GDP in 2019 is 5 trillion USD.) The "Claim Amount" value is input manually with arbitrary numbers, and this enormous number could be a "correct input." The mean we computed(solution 1) is also technically correct. But from analysis point of view it is a meaningless and misleading number.

Analyzing data in a top-down manner could be helpful: Clarify your question and choose the statistic with enough prior knowledge. (The picture below explains those two ways well. Source)

Q3. What are the top three categories (by volume) of claims filed in New York City?

Q3 Discussion: Discuss these results.

  1. New York is the city with highest count of "Claim Number". (It is natural considering New York is the largest city in the U.S. in population.)
  2. The "Property Loss" had a higher percentage in New York meanwhile less "Property Damage" happened.
  3. From parallel plot we noticed New York has higher rejection rate compare to average.

Q4. What’s the peak month for claims in the Pacific timezone? For approved claims?

Answer: We found the peak month for claims was January for both total claims and approved claims.

To visualise trends, we used two different approachs:

  1. Monthly bar plot. It represents the trend discretely.
  2. By day moving window line plot with 30 days windows.

Both plots describe the peak period for the claims in the Pacific timezone, which is between Jan and May.

The peak period is directly linked to the travel season of Hawaii. The peak tourism season starts in the middle of December. However, the incident happened at the return timing, which is shifted to early Jan.

The peak tourism season in Hawaii typically starts in the middle of December and continues until the end of March or mid-April. The off-season stretches from the middle of April and continues until mid-June, and resumes again from September until crowds tick up before the holidays.

Data source: hawaii.gov

Q5. Which airline has the highest rejection rate for claims?

If we only look at the "rejection rate" estimated by historical data, we may notice that the highest one has a small sample size.

We can either:

  1. Set a minimum #claims threshold empirically or use a sample size calculator.

  2. Decide it statistically.

    • Each group of airline is treated as an experiment of a binomial distribution.
    • We could conduct Two-Sample Proportions test to check if the sample size is enough to say the airline has highest "rejection rate" or not.

Q5 Answer: "Jet Blue" has the largest #claims with no significant difference in rejection rate compared to previews one.

The rejection rate for "Jet Blue" is 58.1%. (p-value = 0.05)

Q5 Discussion: Visualize how this has changed over time for that airline.

Q6. Explore the data any way you like and discuss your findings.

I have put my EDA process at "0.3 Simple EDA before the start!". In this paragraph, I will:

  1. Do feature engineering for ML prediction
  2. Train classifiers by RF and LGBM.
  3. Visualize the feature importance and the SHAP value for better understanding.

Q6: Summary

Both Random Forest and LightGBM models have the AUC around 70% for the classification task. The current performance is acceptable, but we could improve it after some parameter tuning.

From the feature importance and SHAP value, we got:

  1. Categorical features are essential for predicting, especially the airport you took the screening.
  2. A lower claim amount positively impacts the approval of the claim.
  3. If the claim contains multiple items, it is more possible to be rejected.
  4. Some items in the claims are easier to be approved. (e.g., Clothing, Eyeglasses) On the other hand, some are more difficult. (e.g., Locks, Jewelry)